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Abstract 


This paper describes our proposed system for the AAAI- 
CAD21 shared task: Predicting Emphasis in Presentation 
Slides. In this specific task, given the contents of a slide we 
are asked to predict the degree of emphasis to be laid on 
each word in the slide. We propose 2 approaches to this prob- 
lem including a BiLSTM-ELMo approach and a transformers 
based approach based on ROBERTa and XLNet architectures. 
We achieve a score of 0.518 on the evaluation leaderboard 
which ranks us 3“ and 0.543 on the post-evaluation leader- 
board which ranks us 1“ at the time of writing the paper. 


Introduction 


Emphasis Selection for written text in visual media from 
crowdsourced label distributions was first proposed by Shi- 
rani et al. (2019) and then by Shirani et al. (2020a) in 
SemEval-2020 Task 10, Emphasis Selection for Written 
Text in Visual Media (Shirani et al. 2020b). AAAI-CAD21 
shared task: Predicting Emphasis in Presentation Slides 
(Shirani et al. 2021) builds on the same SemEval-2020 Task 
10. Presentation slides have become quite common in work- 
place scenarios and researchers have previously developed 
resources that guide presenters on the aspects of overall 
style, color, and font sizes to ensure that the graphical rep- 
resentation of the slide creates an impact on the viewer’s 
mind and the viewer can relate and understand the message 
that the presenter is trying to relay through the slide. This 
shared task aims at designing automated approaches to pre- 
dict which word in the slide should be emphasized (making 
bold or italics) to improve the visual appeal of the slide. A 
pictorial example of what this shared task aims to achieve 
can be seen in Figure 1. 

To solve this problem we treat this task as a sequence la- 
belling problem. Given the contents of an entire slide d = 


{W1, W2, ..-, Wn} as the input text, we predict the emphasis 
probability for each word in the contents of the slide e = {e1, 
C2, «5 Cn}. 


We mainly try two approaches to solve this problem. 
In our transformers approach, we experiment with two 
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Your Business Case for SEO Your Business Case for SEO 


* Good SEO draws new visitors, audiences to your website + Good SEO draws new visitors, audiences to your website 


* Helps bring better leads to your website * Helps bring better leads to your website 


* Improves your positioning against your competitors * Improves your positioning against your competitors 


+ Supports and builds brand strength, online reputation * Supports and builds brand strength, online reputation 
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* Saves money when compared to buying search ads * Saves money when compared to buying search ads 


Figure |: The left slide is plain text. The right side shows the 
important emphasized words in the slide. 


different transformer based model architectures, namely 
RoBERTa (Liu et al. 2019) and XLNet (Yang et al. 2019). 
Our choice of transformer architectures is inspired by the 
best performing architectures in SemEval-2020 Task 10, 
Emphasis selection for written text in visual media (Sing- 
hal et al. 2020; Anand et al. 2020). Both these models were 
pre-trained on large amounts of unannotated data in an un- 
supervised manner. A particular token w in a slide is first 
passed through these transformer models to obtain embed- 
ding of each word in the form of vector representations, after 
which these vectors are passed through BiLSTM and fully 
connected layers for classification. We also keep the trans- 
former part of the model trainable and fine-tune the weights 
on our downstream task of emphasis prediction. 


In our second approach, we use a BiLSTM + ELMo 
model inspired by the baseline paper (Shirani et al. 2019). 
We modify the baseline model and use character embed- 
dings together with pre-trained ELMo embeddings of each 
word and feed them to BiLSTM + Attention and fully con- 
nected layers for predicting emphasis scores. Additionally, 
we also concatenate some word-level features in the atten- 
tion output before feeding it to the fully connected layers. 
A part of our modification is inspired by the team that stood 
3rd on the SemEval-2020 Task 10 leaderboard (Singhal et al. 
2020). 


Additionally, we employ 2 approaches to training all mod- 
els , i.e, BCE or Binary Cross Entropy Loss to directly pre- 
dict emphasis probabilities and also KL Divergence Loss 
(Kullback and Leibler 1951) which uses Label Distribution 
Learning (LDL) (Geng 2016) to learn the probabilities of 
both emphasis and non-emphasis as used in the baseline pa- 
per (Shirani et al. 2019). 


Literature Review 


A lot of work in NLP has been done on keyphrase extraction 
in long texts from scientific articles or news (Augenstein 
et al. 2017; Zhang et al. 2016). Keyword detection mainly 
focuses on finding important nouns or noun phrases from the 
input text. Emphasis prediction on the other hand focuses on 
the automated emphasizing of words in the input text that in- 
crease the visual appeal of the text and makes it easier for the 
viewer of the text to understand the actual message trying to 
be relayed through it. 

Word emphasis prediction has also been explored in spo- 
ken data using acoustic and prosodic features (Mishra, Srid- 
har, and Conkie 2012; Chen and Pan 2017). Emphasis Se- 
lection for written text in visual media was first proposed 
by Shirani et al. (2019) and then by Shirani et al. (2020a) 
as a SemEval-2020 Task. The baseline paper (Shirani et al. 
2019) uses end-to-end label distribution learning (LDL) to 
predict emphasis scores on short text. The model has an em- 
bedding layer which is either Glove (Pennington, Socher, 
and Manning 2014) or ELMo (Peters et al., 2018) followed 
by BiLSTM + Attention and fully connected layers. They 
used Adobe Spark Dataset! for their experiments. Hereon 
this model will be referred to as the Baseline” model in our 
paper. 

Team ERNIE (Huang et al. 2020) from Semeval-2020 
Task 10 who stood 1* on the leaderboard, investigated the 
performance of several transformer-based models includ- 
ing ERNIE 2.0, XLM-RoBERTa, RoBERTa, ALBERT to- 
gether with a combination of pointwise regression and pair- 
wise ranking loss. The authors also tried some augmenta- 
tion schemes and word-level lexical features and reported 
ERNIE 2.0 with the addition of lexical features to be the 
best performing model on the shared-task dataset. 

Team IITK (Singhal et al. 2020) that stood 3 on the 
leaderboard also explored a number of transformer-based 
datasets including variations of BERT, RoBERTa, XLNet, 
GPT-2 and XLNet and also a modification on the baseline 
model. Parts of our modification on the baseline model are 
also based on this model reported here. Their final results 
were obtained from a simple ensemble of a number of their 
transformer-based models. 

Team MIDAS (Anand et al. 2020) which stood 11" on 
the leaderboard also used BERT, RoBERTa and XLNet to- 
gether with a combination of either BiLSTM and Dense or 
just Dense layers. 

Learning from annotations from different annotators has 
been explored with majority voting (Laws, Scheible, and 
Schiitze 2011) or by learning individual annotator expertise 
(Yang et al. 2018; Rodrigues and Pereira 2017; Rodrigues, 
Pereira, and Ribeiro 2013). Most work on this takes only 
one label sequence as correct. The baseline paper (Shirani 
et al. 2019) was the first work to have used Label Distribu- 
tion Learning (Geng 2016) for a sequence labeling task. We 
also explore this learning scheme in the experiments men- 
tioned in our paper together with Binary Cross Entropy loss 
on probabilities obtained from the dataset annotations. 


'https://spark.adobe.com/ 


Background 
Problem Definition 


Given a sequence of words or tokens d= {wy , Wa, ..., Wn } in 
a slide, the task is to compute a probabilistic score e; € |0, 1] 
for each w; in d which indicates the degree of emphasis to 
be laid on the word. 


Evaluation Metric 


The evaluation metric for our problem is defined as follows: 


For a given m (1,5 and 10), we first define 2 sets, gs) 7 
set of m words with top m probabilities according to ground 


truth and ge ) ~ set of m words with top m according to 


the model predictions. To get giz ) each word in the sen- 
tence has been manually annotated by 8 annotators. Based 
on these 2 sets, we define Match,, as: 


5 ng 
|Dyest| 








Lene 


Match,, = (1) 


where D;..; 18 the dataset and z is the token instance. We 
find Match,, form € {1,5,10} and express our final score 
as a simple average over all 3 of them. 


Dataset 


Dataset Statistics for the dataset provided in the AAAI- 
CAD21 shared task is shown in Table 1. Each training in- 
stance is a complete slide with all the tokens present in the 
slide. Additionally, the sentence-wise divisions in the slides 
are also provided in the data. The entire training dataset was 
annotated by a total of 8 annotators on token-level empha- 
sis. The dataset was annotated with a BJO tagging scheme 
where each annotator either annotated the token as an em- 
phasized token (B or I) or not (O). Thus, the probability of 
emphasis for each token was calculated as an average score 
of all annotations. The annotation scheme and the emphasis 
probability calculation has been shown with an example in 
Table 3. More information about the task and data creation 
can be found in Shirani et al. (2021). 


ae Total Slides | Total Sentences | Total Tokens 
[241 8849 96934 


1173 [2822 
2569 28108 


Table 1: Train, Development and Test Dataset Description 








Table 2: Token length description 
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Table 3: Annotation scheme and emphasis probability calculation on a sample sentence from the Train dataset. 


System Description 
Token Level Features 


We tried investigating the data to find token level features 
that can enhance the performance of our BiLSTM-ELMo 
model. We tried finding features by analyzing a particular 
feature’s average emphasis score and the number of times a 
word with that feature occurred in our dataset. The average 
emphasis scores of the token with these features and the total 
count can be found in Table 4. Initially, we tried only shape 
and syntactic features of words by concatenating them with 
the attention output as described in our system description. 
The only feature that had given us an improvement over the 
baseline score was POS (Parts of Speech) tags. 


: communication af pe 
Managementt Franesor 
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é / model Pre benefit 
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Technical 
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Table 4: Average Emphasis Scores and Count 


Train (Avg/Nos.) | Dev (Avg/Nos.) 





given in Table 5. The model was trained and inferred with a 
BIO tagging scheme and it was processed to a binary fea- 
ture where “B” and “I” tags were termed as | and “O” as 
0. This feature when used together with POS tags gives us 
a decent improvement on the baseline results. We name this 
feature the “Keyphrase Feature” in all our experiments. 
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Figure 2: Word Cloud of tokens having emphasis probability 
of > 0.5 


Upon analysis of words with an emphasis score > 0.5 we 
noticed that most of them were scientific keywords. Thus 
we created our own feature by training a sequence labeling 
BiLSTM-CRF model with BERT (Peters et al. 2018a) word 
embeddings as input to the model with the information ex- 
traction datasets used for scientific keyphrase extraction by 
Sahrawat et al. (2019) . We use the python flair” library for 
this task. The results of the model trained for this task are 


*https://github.com/flairNLP/flair 


Table 5: Keyphrase Extraction Model Results 


Our Approach 


BiLSTM-ELMo Approach Our BiLSTM-ELMo §$ap- 
proach is inspired by the baseline paper (Shirani et al. 
2019) where we extract the ELMo embeddings (Peters et al. 
2018b) EzrtMo for each word in a sequence and addition- 
ally, we pass the input through a character-level BiLSTM 
Network where the combined forward and backward em- 
bedding for the last character of each word is then passed 
through a Highway Layer (Singhal et al. 2020) which effec- 
tively provides us with contextual word-level embeddings 
Enway for our entire sequence. These contextual word-level 
embeddings are then concatenated with the extracted ELMo 
Embeddings for each word to produce the final word embed- 
dings FE. 

We pass F through a BiLSTM Layer followed by an 
Attention Layer. The output of the attention layer is then 
concatenated with the POS tags (Singhal et al. 2020) and 
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Figure 3: The BiLSTM-ELMo Model 


Keyphrase Feature. for the corresponding word at each time- 
step. Now combined, the attention output and the external 
features are fed to a Time Distributed Dense Layer followed 
by our Time Distributed Output Layer. The activation func- 
tion g of the Output Layer is either Sigmoid or Softmax 
depending on whether the Loss Criterion is Binary Cross- 
Entropy Loss (in case of Sigmoid Activation) or Kullback- 
Liebler Divergence Loss (in case of Softmax Activation) 
used for Label Distribution Learning (LDL). 


Transformers Approach Our Transformers approach 
makes use of one of two Transformer Architectures that 
is, XLNet or RoBERTa. First, the tokenized word input is 
passed through the transformer architecture and the outputs 
of all encoding layers of the transformer are concatenated to- 
gether to get the final embedding F for any given word. This 
embedding F is now fed through a BiLSTM Layer followed 
by a set of Time Distributed Dense Layer. Finally, the out- 
put of the Time Distributed Dense Layers are fed to our Time 
Distributed Output Layer with Activation Function g. Here, 
g can either be Sigmoid or Softmax when Loss Function is 
Binary Cross-Entropy or KL-Divergence respectively. 


Experimental Setup 


We use PyTorch ? Framework for our Deep Learning models 
along with the Transformer implementations, pre-trained 
models and, specific tokenizers in the HuggingFace library*. 


In the BiLSTM-ELMo Approach, we use a hidden size of 
300 for the character-level LSTM Layers and on top of that, 
we use one highway layer which gives us word-level embed- 
dings Enway. These embeddings are then concatenated with 


*https://pytorch.org/ 
“https://huggingface.co/transformers/ 


their corresponding ELMo embeddings F’'zz,70 where the 
embeddings have 2048 dimensions. This concatenated vec- 
tor is passed through a BiLSTM Layer with an output size 
of 512 dimensions in each direction. The Attention Layer 
uses a Self-attention mechanism, the output of the attention 
mechanism is concatenated with the POS tags and Keyword 
Feature for each word so that this information can be used 
by the classifier to make better predictions. The final stage 
of the classifier consists of a Time Distributed Dense Layer 
with a hidden size of 20 and ReLU Activation. Finally, the 
output layer has | output neuron if the activation function is 
Sigmoid and the loss function is Binary Cross-Entropy and 2 
output neurons in case of a Softmax activation function and 
KL-Divergence Loss Function. The dropout layer probabili- 
ties were set to 0.3 for all layers to avoid overfitting. 


In the transformers approach, we used the ROBERTa and 
XLNet transformers without freezing any layers of the net- 
work and the output of all encoder layers are concatenated 
to make word-level embeddings /. These word embeddings 
are then passed through two BiLSTM Layers with an out- 
put size of 256 dimensions in each direction. The output 1s 
then fed to a pair of BiLSTM Layers with 256-dimensional 
output in both directions. In the classifier, the output of the 
BiLSTM is fed to a pair of Time Distributed Dense Layers 
with a hidden layer size of 20 and ReLU Activation and fi- 
nally to the Time Distributed Output Layer which has either 
1 or 2 output neurons depending on whether the activation 
function used is Sigmoid or Softmax respectively. Dropout 
Layers with Dropout Probability 0.3 are also added to pre- 
vent overfitting. 


When using Sigmoid activation, we aim to predict a single 
output §<'* which represents the probability of emphasis to 
be laid on the t*” token. This probability is used with the Bi- 
nary Cross-Entropy Loss to train the model. However, in the 
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Figure 4: The Transformer-Based Model 


case of Softmax, we predict a probability distribution Y <*> 
over 2 classes {0 = no emphasis, 1 = emphasis}. This 
distribution is used with the KL-Divergence Loss function 
to perform Label Distribution Learning. The equations for 


y<'> and Y <*> are as follows: 


= Sigmod 2 ) (2) 
Where z<‘> is the logit of the last output layer for the ¢*” 
token. 


Y< = {Softmaz(zé'>), Softmax(zf**)} @) 


Where z;<‘* is the logit of the last output layer for the t*” 


token and 7*” class. 


In both the Transformers and BiLSTM-ELMo ap- 
proaches, the Binary Cross-Entropy (BCE) Loss as well as 
the KL-Divergence (KLD) Loss were used to train the mod- 
els. The Match,, score is used as an evaluation for all our 
models. The equations for both the loss functions are as fol- 
lows: 


BCE, ie | er eo, log(g<"7) 
— (1— yS'*). log(1—g*"") 4) 
Where y<‘* © {0,1} is the true label for emphasis laid 


on each token and 4<°~ is the output of the sigmoid activa- 
tion for each token. 





<t>|)p<t> <i> ‘Gus 
KID Ye yy SY =< log (=) (5) 
Where Y<‘* is the true probability distribution for 


the emphasis laid on each token and Y<*> is the output 
distribution of the softmax activation for each token. 


We use the Adam Optimizer for training the models with 
a learning rate of le-4 for the BiLSTM-ELMo model for 100 
epochs and 2e-5 for the Transformer-based models for 100 
epochs. The training was performed on 1 NVIDIA Titan X 
GPU. Our code is available on Github”. 


Results 


In Table 6 we present scores for both our BiLSTM-ELMo 
and Transformers approach trained on both BCE Loss and 
KLDivergence Loss for LDL. As we can see in the results, 
LDL as used by (Shirani et. al 2019) doesn’t give a huge 
improvement over results and at times even diminishes the 
results. 


[Modls—~—SS 
/BiLSTM-FLMo (POS) ——___—*| 0 
-BiLSTM-FLMo (POS) (LDL)__[ 0 
 BiLSTM-FLMo (POS, Keyphrase) | 0 

- , 0 


BiLSTM-ELMo (POS, Keyphrase) 
(LDL) 


PXENeSS™~S~S~ CBG | 
PROBERTa (DL) —SS~(S IS FO 


Table 6: Performance of BiLSTM-ELMo and Transformers 
approach on development and Test set. The results are ex- 
pressed in terms of average Match.,,, form € {1,5, 10}. 
LDL indicates that label distribution learning was employed 
to train the model with KL-Divergence as the loss function, 
Binary Cross Entropy otherwise. For BiLSTM-ELMo model 
the extra features concatenated at the attention layer have 
been mentioned with each experiment. Baseline. indicates 
the scores by the baseline model defined by Shirani et al. 
(2019) 
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For our final submissions, we tried an ensemble of scores 
from different models shown in Table 7. Our best scores 
on the Evaluation leaderboard were obtained using an en- 
semble of XLNet and RoBERTa with LDL where we stood 
3rd. Meanwhile, our best scores on the Post-Evaluation 
leaderboard were obtained using an ensemble of XLNet and 
BiLSTM-ELMo approach with POS tags and Keyphrase 
Feature where we currently stand Ist on the leaderboard. 


Tes 
XLNet + RoBERTa (LDL) 0.547 | 0.5 





Table 7: Performance of different ensemble models 


Additionally, we also ran experiments by dividing the pre- 
sentations into their constituent sentences in the train and 
development data. Thus each training instance now corre- 
sponds to a particular sentence belonging to a presentation 
slide in the original corpus. The development set results can 
be found in Table 8. The evaluation scheme used in this ex- 
periment uses the same Match,,, as described in the Evalua- 
tion Metric section but with m = 1, 2,3, 4 as used in Shirani 
et al. (2020a). 


Table 8: Sentence-wise results on the Development set 
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undivided attention . The importance of that can not be over-emphasized . 
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It is extremely important that parents take time to SLOW DOWN and give their child their 
undivided attention . The importance of that can not be over-emphasized . 
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Figure 5: Emphasis Heatmaps 1) Ground Truth 11) BiLSTM- 
ELMo 111) XLNet iv) Best Ensemble Model 


Analysis 
Length vs Performance 


We wanted to understand how the performance of our mod- 
els was affected by the length of the instances. Table 9 
summarizes the performance of our best performing single 
model, i.e, XLNet on the development set divided into three 
sets, Short (< 40 tokens, 80 samples), Medium (40 to 90 
tokens, 262 samples), and Long (>90 tokens, 50 samples). 
As we can see, the model performance deteriorates with the 
increasing length of the instances. 


XEN 
Small (< 40) 0.648 


Medium (>40 and < 90) | 0.549 
Large (>90) 





Table 9: Average Match,, for best performing XLNet 
model on different size of instances in the development set 


Emphasis vs Parts of Speech 


Table 10 shows POS (Parts of Speech) tags vs. average em- 
phasis on the development dataset. We did this experiment 
to understand how our model predictions performed on each 
POS tag when compared to the actual human-annotated em- 
phasis scores on the development set. We noticed that the 
original average emphasis scores were highest on Adjectives 
followed by Noun. On comparing our models, we found that 
XLNet was able to almost accurately predict the emphasis 
scores on Adjectives and Noun respectively, and BiLSTM- 
ELMo also had the highest predictions on Adjectives and 
Noun respectively. We also noticed that XLNet did a better 
job on predicting the emphasis score on different POS tags 
where the predictions were either very close to the human 
scores or marginally lesser. On the other hand, we noticed 
that BiLSTM-ELMo’s predictions fell short by bigger mar- 
gins when compared to XLNet and gave more emphasis to 
Adverbs than that in the development set. 


XLNet 
|} Noun | 


Human | Bi 

0.168 
0.083 0.173 
0.181 


0.042 
0.108 
0.022 
0.025 


Table 10: POS tags vs. average emphasis on development 
dataset 





Conclusion 


In this paper, we present our approach to AAAI-CAD21 
shared task: Predicting Emphasis in Presentation Slides. Our 
best submission gave us an average Match, of 0.518 plac- 
ing us 3™ on the Evaluation phase leaderboard and an aver- 
age Match, of 0.543 placing us 1‘ on the Post-Evaluation 
leaderboard at the time of writing the paper. Future work 1n- 
cludes using a hierarchical approach to emphasis prediction 
as a sequence labeling task using both sentence-level (indi- 
vidual sentence in a slide) and slide-level representations of 
a word (Luo, Xiao, and Zhao 2019). 
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